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Abstract 

We study the problem of learning to rank from pairwise preferences, and solve a long-standing 
open problem that has led to development of many heuristics but no provable results for our 
particular problem. 

The setting is as follows: We are given a set V of n elements from some universe, and we 
wish to linearly order them given pairwise preference labels, given two elements u, v € V , a 
pairwise preference label is obtained as a response, typically from a human, to the question 
which if preferred, u or v? We assume no abstention, hence, either u is preferred to v (denoted 
u -< v) or the other way around. We also assume possible non-transitivity paradoxes which may 
arise naturally due to human mistakes or irrationality. 

The goal is to linearly order the elements from the most preferred to the least preferred, while 
disagreeing with as few pairwise preference labels as possible. Our performance is measured 
by two parameters: The loss (number of pairwise preference labes we disagree with) and the 
query complexity (number of pairwise preference labels we obtain). This is a typical learning 
problem, with the exception that the space from which the pairwise preferences is drawn is finite, 
consisting of (™) possibilities only. Our algorithm reduces this problem to another problem, for 
which any standard learning black-box can be used. The advantage of the reduced problem 
compared to the original one is the fact that never more than 0(n polylog(n, e" 1 )) labels are 
needed (including the query complexity of the reduction) in order to obtain the same risk that the 
same black-box would have incurred given access to all possible (™) labels in the original problem, 
up to a multiplicative regret of (1 + e). The label sampling is adapative, hence, viewing our 
algorithm as a preconditioner for a learning black-box we arrive at an active learning algorithm 
with provable, almost optimal bounds. We also show that VC arguments give significantly worse 
query complexity bounds for the same regret in a non-adaptive sampling strategy. 

Our main result settles an open problem posed by learning-to-rank theoreticians and prac- 
titioners: What is a provably correct way to sample preference labels? 

To further show the power and practicality of our solution, we analyze a typical test case 
in which the learning black-box preconditioned by our algorithm is a regularized large margin 
linear classifier. 
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1 Introduction 



We study the problem of learning to rank from pairwise preferences, and solve a long-standing open 
problem that has led to development of many heuristics but no provable results. 

The setting is as follows: We are given a set V of n elements from some universe, and we wish 
to linearly order them given pairwise preference labels, given two elements u, v £ V, a pairwise 
preference label is obtained as a response, typically from a human, to the question which if preferred, 
u or v? We assume no abstention, hence, either u is preferred to v (denoted u -< v) or the other 
way around. 

The goal is to linearly order the elements from the most preferred to the least preferred, while 
disagreeing with as few pairwise preference labels as possible. Our performance is measured by 
two parameters: The loss (number of pairwise preference labes we disagree with) and the query 
complexity (number of pairwise preference labels we obtain). This is a typical learning problem, 
with the exception that the sample space is finite, consisting of possibilities only. 

The loss minimization problem given the entire n x n preference matrix is a well known NP- 
hard problem called MFAST (minimum feedback arc-set in tournaments) [7]. Recently, Kenyon 
and Schudy [18] have devised a PTAS for it, namely, a polynomial (in n) -time algorithm computing 
a solution with loss at most (1 + e) the optimal, for and e > (the degree of the polynomial may 
depend on e). In our case each edge from the input graph is given for a unit cost. Our main 
algorithm is derived from Kenyon et al's algorithm. Our output, however, is not a solution to 
MFAST, but rather a reduction of the original learning problem to a different, simpler one. The 
reduced problem can be solved using any general ERM (empirical risk minimization) black-box. 
The sampling of preference labels from the original problem is adaptive, hence the combination 
of our algorithm and any ERM blackbox is an active learning one. We give examples with an 
SVM based ERM black-box toward the end, and show that our approach gives rise to a reduced 
SVM problem which provably approximates the original problem to within any arbitrarily small 
error relative to the original SVM optimal solution. The total number of pairwise preference labels 
acquired in the reduction and in the construction of the reduced SVM is significantly smaller than 
what a VC-dimension type argument would guarantee. 

Our setting defers from much of the learning to rank (LTR) literature. Usually, the labels used 
in LTR problems are responses to individual elements, and not to pairs of elements. A typical 
example is the 1..5 scale rating for restaurants, or 0,1 rating (irrelevant/relevant) for candidate 
documents retrieved for a query (known as the binary ranking problem). The goal there is, as in 
ours, to order the elements while disagreeing with as little pairwise relations as possible, where 
a pairwise relation is derived from any two elements rated differently. Note that the underlying 
preference graph there is transitive, hence no combinatorial problem due to nontransitivity. In fact, 
some view the rating setting as an ordinal regression problem and not a ranking problem. Here 
the preference graph may contain cycles, and is hence agnostic with respect to the concept class 
we are allowed to output from, namely, permutations. We note that some LTR literature does 
consider the pairwise preference label approach, and there is much justification to it (see [U [15] and 
reference therein). As far as we know, our work provides a sound solution to a problem addressed 
by machine learning practitioners (e.g. [8]) who use pairwise preferences as labels for the task of 
learning to rank items, but wish to avoid obtaining labels for the quadratically many preference 
pairs, without compromising low error bounds. We also show that the fear of quadraticity found in 
much work dealing with pairwise preference based learning to rank (e.g., from Crammer et. al [TO] 
the [pairwise] approach is time consuming since it requires increasing the sample size ... to 0(n 2 )) 
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is unfounded in the light of new advances in combinatorial optimization [IJ [18] . 

It is important to note a significant difference between our work and Kenyon and Schudy's 
PTAS |18j . which is also the difference between the combinatorial optimization problem and the 
learning counterpart. A good way to explain this to to compare two learners, Alice and Bob. On 
the first day, Bob queries all pairwise preference labels and sends them to a perfect solver for 
MFAST. Alice uses our work to query only 0(n polylog(n, e -1 )) preference labels amd obtains a 
decomposition of the original input V into an ordered list of sub-problems V\, ■ ■ ■ , where each 
Vi is contained in V . Using the same optimizer for each part and concatenating the individual 
output permutations, Alice will incur a loss of at most (1 + e) that of Bob. So far Alice might 
not gain much, because the decomposition may consist of a single block, hence no reduction. The 
next day, Bob realizes that his MFAST solver cannot deal with large inputs because he is trying 
to solve an NP-Hard problem. Also, he seeks a multiplicative regret of (1 + e) with respect to the 
optimal solution (we also say a relative regret of e), and his sought e is too small to use the PTAS0 
To remedy this, he takes advantage of the fact that the set V does not merely consist of abstract 
elements, but rather each u G V is endowed with a feature vector (p(u) and hence each pair of 
points u, v is endowed with the combined feature vector (cp(u) , <p(v)) . As in typical learning, he 
posits that the order relation between u, v can be deduced from a linear function of (np(u) , ip(v)) , 
and invokes an optimizer (e.g. SVM) on the relaxed problem, with all pairs as input. Note that 
Bob may try to sample pairs uniformly to reduce the query complexity (and, perhaps, the running 
time of the relaxed solver), but as we show below, he will be discouraged from doing so because 
in certain realistic cases a relative regret of e may entail sampling the entire pairwise preference 
space. Alice uses the same relaxed optimizer, say, SVM. The labels she sends to the solver consist 
of a uniform sample of pairs from each block Vi, together with all pairs u,v residing in separate 
blocks from her aforementioned construction decomposition. From the former label type she would 
need only 0{n polylog(n, e -1 )) many, because (per our decomposition design) within the blocks the 
cost of any solution is high, and hence a relative error is tantamount to an absolute error of similar 
magnitude, for which simple VC bounds allow low query complexity. From the latter label type, 
she would generate a label for all pairs u, v in distinct V{, Vj, using a "made up" label corresponding 
to the order of Vi,Vj (recall that the decomposition is ordered). Since both Bob and Alice used 
SVM with the same feature vectors (and the same regularization) , there is no reason to believe 
that the additional cost incurred by the relaxation inaccuracies would hurt neither Bob nor Alice 
more than the other. The same statement applies to any relaxation (e.g. decision trees), though 
we will make a quantitative statement for the case of large margin linear classifiers below. 

Among other changes to Kenyon and Schudy's algorithm, a key technique is to convert a 
highly sensitive greedy improvement step into a robust approximate one, by careful sampling. 
The main difficulty stems from the fact that after a single greedy improvement step, the sample 
becomes stale and requires refereshing. We show a query efficient refreshing technique that allows 
iterated approximate greedy improvement steps. Interestingly, their original analysis is amenable 
to this change. It is also interesting to note that the sampling scheme used for identifying greedy 
improvement steps for a current solution are similar to ideas used by Ailon et. al [HE] an d Halevy 
et. al |13] in the context of property testing and reconstruction, where elements are sampled from 
exponentially growing intervals in a linear order. 

Ailon et. al's 3-approximation algorithm for MFAST using QuickSort PQ is used in Kenyon et. 
al |18] as well as here as an initialization step. Note that this is a sublinear algorithm. In fact, it 

The running time of the PTAS is exponential in e _1 . 
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samples only 0(n log n) pairs from the Q) possible, on expectation. Note also that the pairs from 
which we query the preference relation in QuickSort are chosen adaptively. 

2 Notation and Basic Lemmata 
2.1 The Learning Theoretical Problem 

Let V denote a finite set that we wish to rank. In a more general setting we are given a sequence 
V , V 2 , . . . of sets, but there is enough structure and interest in the single set case, which we focus 
on in this work. Denote by n the cardinality of V. We assume there is an underlying preference 
function W on pairs of elements in V, which is unknown to us. For any ordered pair u, v E V, 
the preference value W(u, v) takes the value of 1 if it is deemed preferred over v, and otherwise. 
We enforce W(u,v) + W(v,u) = 1, hence, (V, W) is a tournament. We assume that W is agnostic 
in the sense that it does not necessarily encode a transitive preference function, and may contain 
errors and inconsistencies. For convenience, for any two real numbers a, b we will let [a, b] denote 
the interval {x : a < x < b} if a < b and {x : b < x < a} otherwise. 

Assume now that we wish to predict W using a hypothesis h from some concept class T~L. The 
hypothesis h will take an ordered pair (u, v) G V as input, and will output label of 1 to assert 
that u precedes v and otherwise. We want H to contain only consistent hypotheses, satisfying 
transitivity (i.e. if h(u,v) = h(v,w) = 1 then h(u,w) = 1). A typical way to do this is using a 
linear score function: Each u £ V is endowed with a feature vector (p(u) in some RKHS H, a weight 
vector w G H is used for parametrizing each h w 6 Ti, and the prediction is as followsH 



h w (u,v) 



1 (w,<p(u)) > (w,<p(v)) 
(w,ip(u)) < (w,tp(v)) 
l«<u otherwise 



Our work is relevant, however, to nonlinear hypothesis classes as well. We denote by II(V) the set 
permutations on the set V, hence we always assume % C II(V). (Permutations ir are naturally 
viewed as binary classifiers of pairs of elements via the preference predicate: The notation is, 
it(u,v) = 1 if and only if u ^ n v, namely, if u precedes v in it. Slightly abusing notation, we also 
view permutations as injective functions from [n] to V, so that the element 7r(l) G V is in the first, 
most preferred position and 7r(n) is the least preferred one. We also define the function p n inverse 
to 7r as the unique function satisfying Tr(p 7r (v)) = v for all v G V. Hence, u v is equivalent to 
Pk(u) < p w (v). ) 

As in standard ERM setting, we assume a non-negative risk function C U)V penalizing the error 
of h with respect to the pair u, v, namely, 

C u>v (h,V,W) = 

The total loss, C(h,V,W) is defined as C u>v summed over all unordered u, v G V. Our goal is to 
devise an active learning algorithm for the purpose of minimizing this loss. 

In this paper we find an almost optimal solution to the problem using important breakthroughs 
in combinatorial optimization of a related problem called minimum feedback arc-set in tournaments 



2 We assume that l^is endowed with an arbitrary linear order relation, so we can formally write u < v to arbitrarily 
yet consistently break ties. 
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(MFAST). The relation between this NP-Hard problem and our learning problem has been noted 
before [9], but no provable almost optimal active learning has been devised, as far as we know. 

2.2 The Combinatorial Optimization Counterpart 

MFAST is defined as follows: Assume we are given V and W and its entirety, in other words, we 
pay no price for reading W. The goal is to order the elemtns of V in a full linear order, while 
minimizing the total pairwise violation. More precisely, we wish to find a permutation ir on the 
elements of V such that the total backward cost: 

C{k,V,W)=Y,W{v,u) (2.1) 

is minimized. The expression in f)2. 1 1) will be referred to as the MFAST cost henceforth. 

When W is given as input, this problem is known as the minimum feedback arc-set in tour- 
naments (MFAST). A PTAS has been discovered for this NP-Hard very recently |18j . Though a 
major theoretical achievement from a combinatorial optimization point of view, the PTAS is not 
useful for the purpose of learning to rank from pairwise preferences because it is not query efficient. 
Indeed, it may require in some cases to read all quadratically many entries in W. In this work 
we fix this drawback, while using their main ideas for the purpose of machine learning to rank. 
We are not interested in MFAST per se, but use the algorithm in [18] to obtain a certain useful 
decomposition of the input (V, W) from which our main active learning result easily follows. 

Definition 2.1. Given a set V of size n, an ordered decomposition is a list of pairwise disjoint 
subsets Vi,...,V k QV such that Lif =l Vi = V. For a given decomposition, we let W\y t denote the 
restriction of W to Vi x Vi for i = l,...,k. Similarly, for a permutation ir G II(w) we let n\vi 
denote the restriction of the permutation to the elements of Vi (hence, tt^ G II(Vi)). We say 
that 7r G II(V) respects Vi,...,Vk if for all u G V{,v G Vj,i < j, u ~< T v. We denote the set of 
permutations n G 11(F) respecting the decomposition Vi, . . . , V k by 11(14, ■ • • , Vfc)- We say that 
a subset U of V is small in V if \U\ < log nj log log n, otherwise we say that U is big in V. A 
decomposition V\, . . . , 14 is e-good with respect to W ifH 

• Local chaos: 

^ E C( m ,V t ,W lVt )>e* £ (2.2) 
%:Vi big in V i-.V, big in V 



Approximate optimality: 



min C(a, V, W) < (1 + e) min C(vr, V, W) . (2.3) 
<ren(Vi,...,v fc ) 7ren(V) 



Intuitively, an e-good decomposition identifies a block-ranking of the data that is difficult to 
rank in accordance with W internally on average among big blocks (local chaos), yet possible to 
rank almost optimally while respecting the decomposition (approximate optimality). We show how 
to take advantage of an e-good decomposition for learning in Section 12.31 The ultimate goal will 
be to find an e-good decomposition of the input set V using 0(polylog(n, e -1 )) queries into W. 

3 We will just say e-good if W is clear from the context. 
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2.3 Basic Results from Statistical Learning Theory 

In statistical learning theory, one seeks to find a classifier minimizing an expected cost incurred on a 
random input by minimizing the empirical cost on a sample thereof. If we view pairs of elements in 
V as data points, then the MFAST cost can be cast, up to normalization, as an expected cost over 
a random draw of a data point. Recall our notation of ir(u, v) denoting the indicator function for 
the predicate u v. Thus tt is viewed as a binary hypothesis function over („), and II(V) can be 
viewed as the concept class of all binary hypotheses satisfying transitivity: ir(u, v)+ir(v, y) > ir(u, y) 
for all u, v, y. 

A sample E of unordered pairs gives rise to a partial cost, Ce defined as follows: 

Definition 2.2. Let (V,E) denote an undirected graph over V, which may contain parallel edges 
(E is a multi-set). The partial MFAST cost Ce(tt) is defined as 

C e (tt,V,W) = (fjlEf 1 Yl w ^ u ) ■ 

u< n v 

(The accounting of parallel edges in E is clear.) The function Ce(-,-,-) can be viewed as an 
empirical unbiased estimator of C(tt,V, W) if E C ( ) is chosen uniformly at random among all 
(multi) subsets of a given size. 

The basic question in statistical learning theory is, how good is the minimizer tt of Ce, in terms 
of CI The notion of VC dimension [19] gives us a nontrivial bound which is, albeit suboptimal (as 
we shall soon see), a good start for our purpose. 

Lemma 2.3. The VC dimension of the set of permutations on V , viewed as binary classifiers on 
pairs of elements, is n — 1 . 

It is easy to show that the VC dimension is at most 0{n log n). Indeed, the number of per- 
mutations is at most nl, and the VC dimension is always bounded by the log of the concept class 
cardinality. That the bound is linear was proven in [6]. We present the proof here in Appendix lAl 
for completeness. The implications of the VC bound are as follows. 

Proposition 2.4. Assume E is chosen uniformly at random (with repetitions) as a sample of 
m elements from (X), where m > n. Then with probability at least 1 — 5 over the sample, all 
permutations tt satisfy: 

\C E («, V, W) - C(ir, V, W)\ = n-O (J^^m^j . 

The consequence of Proposition 12.41 are as follows: If we want to minimize C(ir, V, W) over tt to 
within an additive error of [in 2 , and succeed in doing so with probability at least 1 — 5, it is enough 
to choose a sample E of 0(fi~ 2 (n log n + log<5 -1 )) elements from (^) uniformly at random (with 
repetitions), and optimize Ce(tt,V,W). Assume from now on that 5 is at least e~ n , so that we 
get a more manageable sample bound of 0(/i _2 n log n). Before turning to optimizing Ce(tt, V, W), 
a hard problem in its own right [17^ I12| . we should first understand whether this bound is at all 
good for various scenarios. We need some basic notions of distance between permutations. For two 
permutations n,o~, the Kendall- Tau distance d T (n,a) is defined as 

d T (-K, a) = l[(u v) A (v u)] . 
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The Spearman Footrule distance df 00 t(ir,a) is denned as 

dfoot(vr,cr) = \p n (u) - p a {u)\ . 

u 

The following is a well known inequality due to Graham and Diaconis [Ji] relating the two distance 
measures for all ir, a: 

d T (ir,a) < d {oot (ir,cr) < 2d T (ir,a) . (2.4) 

Clearly d T and df oot are metrics. It is also clear that C(-,V, ■) is an extension of d T (-,-) to 
distances between permutations and binary tournaments, with the triangle inequality of the form 
d r (7r, a) < C(tt, V, W) + C{a, V, W) satisfied for all W and n, a E U(V). 

Assume now that we are able, using Proposition 12 . 41 and the ensuing comment, to find a solution 
7r for MFAST, with an additive regret of 0(/j,n 2 ) with respect to an optimal solution ir* for some 
fi > 0. The triangle inequality implies that the distance <i r (7r,7r*) between our solution and the 
true optimal is f2(/m 2 ). By (12.41) . this means that df 00 t(TT, vr*) = U(pn 2 ). By the definition of d{ OQ t, 
this means that the averege element v € V is translated 0(/m) positions away from its position 
in 7r*. In a real life application (e.g. in information retrieval), one may want elements to be at 
most a constant 7 positions away from their position in a correct permutation. This translates 
to a sought regret of O^n) in C(ir,V,W), or, using the above notation, to fi = 7/n. Clearly, 
Proposition 12.41 cannot guarantee less than a quadratic sample size for such a regret, which is 
tantamount to querying W in its entirety. We can do better: In this work, for any e > we 
will achieve a regret of 0{eC{tt* , V, W)) using 0(polylog(n, e -1 )) queries into W, regardless of how 
small the optimal cost C(ir* ,V,W) is. Hence, our regret is relative to the optimal loss. This is 
clearly not achievable using Proposition 12.41 

Before continuing, we need need a slight generalization of Proposition 12.41 

Proposition 2.5. Let V\, . . . ,Vk be an ordered decomposition ofV. Let B denote the set of indices 
i € [k] such that V% is big in V . Assume E is chosen uniformly at random (with repetitions) as 
a sample of m elements from \J ieS (^), where m > n. For each i = 1, . . . , k, let Ei = E n (^) . 
Define C E (TT,{Vi,...,V k },W) to be 

C E (n,{V 1 ,...,V k },W)=[J2( n ')) l^r'Ef?) 'mCz^v^W^ . (2.5) 

(The normalization is defined so that the expression is an unbiased estimator of^2 i£j3 C(k\y Vi, W^yJ- 

Lf \E{\ = for some i, formally define (^ l ) l \Ei\CE^\Vi-, Vi, W\y.) = 0.) Then with probability at 
least 1 — e~ n over the sample, all permutations ir S 11(1/) satisfy: 

C e (tt, {V 1: ... , V k }, W)-J2 CfrWvVi, W\ Vi ) 

ieB 

Proof. Consider the set of binary functions r[ie£n(Vi) on the domain UigB ^ x defined as 
follows: If u, v £ Vj x Vj for some j S B, then 

{{^i)ieB) (u,v) = TTj(u,v) . 

It is clear that the VC dimension of this function set is at most the sum of the VC dimensions of 
{II(Vi)}j e 0, hence by Lemma 12.31 at most n. The result follows. □ 




n log m + log(l/<5) 



m 
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2.4 Using an e-Good Partition 

The following lemma explains why an e-good partition is good for our purpose. 

Lemma 2.6. Fix e > and assume we have an e-good partition (Definition \2. 1\) V%, . . . , Vk ofV. 
Let B denote the set of i G [k] such that V t is big in V , and let B = [k]\B. Let ni = \Vi\ for i = 
1, . . . ,n, and let E denote a random sample o/0(e _6 n log n) elements from (Jigs C^)' ea °h element 
chosen uniformly at random with repetitions. Let Ei denote E n (^) . Let Ce(tt, {V\, . . . , Vk}, W) 
be defined as in \2. 5\) . For any ir G II(Vi, . . . , Vk) define: 

C(n):=C E (n,{V 1 ,...,V k },W) + Y,C(m> V i> W \vJ+ E ■ 

ieB l<i<j<k (u,v)eVixVj 

Then the following event occurs with probability at least 1 — e~ n : For all a G II(Vi, ... ,Vk) } 

C(a) - C{a, VW) <e min C(vr, V, W) . (2.6) 

7ren(v) 

Also, if a* is any minimizer ofC(-) over H(Vi, . . . ,Vk), then 

C(a*, V, W) < (l + 2e) min C(tt,V,W) . (2.7) 

n<=Il(V) 

Before we prove the lemma, let us discuss its consequences: Given an e-good decomposition 
V±, . . . , Vk of V, the theorem implies that if we could optimize C[a) over a G II (Vj., • • • , Vk), we 
would obtain a permutation tt with a relative regret of 2e with respect to the optimizer of C(-, V, W) 
over n(F). Optimizing Yli^B C C 71 "! Vi ? Vii WW) is easy: Each Vi is of size at most log ra/ log log n, 
hence exhaustively searching its corresponding permutation space can be done in polynomial time. 
In order to compute the cost of each permutation inside the small sets Vi, we would need to 
query W\y. in its entirety. This incurs a query cost of at most J2ieB C2) = 0( ra logn/loglogn), 
which is dominated by the cost of obtaining the e-good partition in the first place (see next sect 
section). Optimizing Ce(^ , {V\, . . . ,Vk} , W) given E is a tougher nut to crack, and is known as 
the minimum feedback arc-set (MFAS) problem and considered much harder than to harder than 
MFAST [EICG]. For now we focus on query and not computational complexity, and notice that 
the size \E\ = 0(e~ 4 n log n) of the sample set is all we need. In SectionH]we show a counterpart of 
Lemma 12.61 which provides similar guarantees for practitioners who choose to relax it using SVM, 
for which fast solvers exist. 

If we assume, in addition, that the decomposition could be computed using 0(n polylog(n, e -1 )) 
labels (as we indeed show in the next section), then we would clearly beat the aforementioned VC 
bound whenever the optimal solution min^gn^y-j C(ir, V, W) is at most 0(n 2 ~ u ), for any v > 0. 

Proof. For any permutation a G II(Vi, . . . ,Vk), it is clear that 

C(a) - C(a, V, W) = C E (a, {V u ..., V k }, W) - J] C(a ]Vi ,Vi, . 

ieB 

By Proposition 12. 5| with probability at least 1 — e~ n the absolute value of the RHS is bounded by 
E^YlfieB C2)' wn i cn i s a t most emin^gnfv) C(-7r,V,W) by (|2,2p . This establishes (|2.6p . Inequality 
(|2.7I) is obtained from (??) together with (j2.3f) and the triangle inequality. 

□ 
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3 A Query Efficient Algorithm for e-Good Decomposing 

The section is dedicated to proving the following: 

Theorem 3.1. Given a set V of size n, a preference oracle W and an error tolerance parameter 

< e < 1, there exists a polynomial time algorithm which returns, with constant probabiliy, an 
e- good partition ofV , querying at most 0(e~ 6 n log 5 n) locations in W on expectation. The running 
time of the algorithm (counting computations) is 0{n polylog(n,e -1 )). 

Before describing our algorithm, we need some definitions. 

Definition 3.2. Let tx denote a permutation over V. Let v G V and i G [n]. We define n v ->i 
to be the permutation obtained by moving the rank of v to i in tt, and leaving the rest of the 
elements in the same order. For example, if V = {x,y,z} and (7r(l), 7r(2), 7t(3)) = (x,y,z), then 
(7^3(1), tt^ 3 (2), 71-^3(3)) = (y,z,x). 

Definition 3.3. Fix a permutation tt over V, an element v £V and an integer i G [n]. We define 
the number TestMove(7r, V, W, v, i) as the decrease in the cost C(-, V, W) achieved by moving from 
7r to 7r„_>j. More precisely, TestMove(7r, V, W, v, i) = C(tt,V,W) — C(ir v -^i,V,W) . Equivalently, if 

1 > Pn(v) then 

TestMove(^, V, W, v,i) = Yl ( W ™ ~ W ™) ■ 

M:p^(u)G[p^(f) + l,i] 

A similar expression can be written for i < p n {v). 

Now assume that we have a multi-set E C . We define TestMoves(7r, V, W, v, i), for i > p n (v), 

as 

TestMove E {ir,V,W,v,i) = llZ^Mj ^ (W{u,v) - W(v,u)) , 

^ u:(u,v)£E 

where the multiset E is defined as {(u,v) G E : p n (u) G [p^{v) + l,i]}. 
Similarly, for % < p w (v) we define 

TestMove E {7r, V,W,v,i) = II^lMI V (W(v, u) - W{u, v)) , (3.1) 

\E\ 

u:(u,v)eE 

where the multiset E is now defined as {(u, v) G E : /^(u) G [i,p w (v) — 1]}. 



Lemma 3.4. Fix a permutation tt over V, an element v G V , an integer i G [n] and another 
integer N. Let E C (Y\ be a random (multi)-set of size N with elements (v,u±), . . . , (v,un), drawn 
so that for each j G [N] the element Uj is chosen uniformly at random from among the elements 
lying between v (exclusive) and position i (inclusive) in tt. Then E[TestMove£(7r, V, W, v, i)] = 
TestMove(7r, V, W, v,i). Additionally, for any 5 > 0, except with probability of failure 5, 



TestMove s (vr, V, W, v, i) - TestMove(vr, V,W,v,i)\ = O [ \i - p v (v)\ 




The lemma is easily proven using e.g. Hoeffding tail bounds, using the fact that |W(it,u)| < 1 for 
all u, v. 
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3.1 The Decomposition Algorithm 

Our decomposition algorithm SampleAndRank is detailed in Algorithm [H with subroutines in Al- 
gorithms [2] and [3] It can be viewed as a query efficient improvement of the main algorithm in 
|18j . Another difference is that we are not interested in an approximation algorithm for MFAST: 
Whenever we reach a small block (line [3|) or a big block with a probably approximately suffi- 
ciently high cost (line [8]) in our recursion of Algorithm [2]) , we simply output it as a block in our 
partition. Denote the resulting outputted partition by V\, . . . , Denote by ft the minimizer of 
C(-,V,W) over n(Vi, . . . , 14). Most of the analysis is dedicated to showing that C(tt,V,W) < 
(1 + e) min 7r6 n(y) C(tt, V, W), thus establishing (|2.3p . 

In order to achieve an efficient query complexity compared to [IB] , we use procedure ApproxLocallmprove 
(Algorithm [3]) to replace a greedy local improvement step in [18] which is not query efficient. Aside 
from the aforementioned differences, we also raise here the reader's awareness to the query efficiency 
of QuickSort, which was established by Ailon et al. in [5| (note: an erroneous proof appears in [2]). 

SampleAndRank (Algorithm [T]) takes the following arguments: The set V we want to rank, the 
preference matrix W and an accuracy argument e. It is implicitly understood that the argument 
W passed to SampleAndRank is given as a query oracle, incurring a unit cost upon each access to 
a matrix element by the procedure and any nested calls. 

The first step in SampleAndRank is to obtain an expected constant factor approximation ir to 
MFAST on V, W, incurring an expected low query cost. More precisely, this step returns a random 
permutation ir with an expected cost of 0(1) times that of the optimal solution to MFAST on V, W. 
The query complexity of this step is O(nlogn) on expectation [5]. Before continuing, we make the 
following assumption, which holds with constant probability using Markov probability bounds. 

Assumption 3.5. The cost C(ir, V, W) of the initial permutation ir computed line® of SampleAndRank 
is at most 0(1) times that of the optimal solution ir* to MFAST on (V, W), and the query cost 
incurred in the computation is O(nlogn). 

Following QuickSort, a recursive procedure SampleAndDecompose is called. It implements a 
divide-and-conquer algorithm. Before branching, it executes the following steps. Lines [5] to [9] are 
responsible for identifying local chaos, with sufficiently high probability. The following line 1101 calls 
a procedure ApproxLocallmprove (Algorithm [3]) which is responsible for performing query-efficient 
approximate greedy steps. We devote the next Sections 13.2113.41 to describing this procedure. The 
establishment of the e-goodness of SampleAndRank's output (establishing (|2.3p ) is deferred to 
Section 13.51 

3.2 Approximate local improvement steps 

The procedure ApproxLocallmprove takes as input a set V of size N, the preference oracle W, 
a permutation ir on V, two numbers Co, e and an integer n. The number n is the size of the 
input in the root call to SampleAndDecompose, passed down in the recursion, and used for the 
purpose of controlling the success probability of each call to the procedure (there are a total of 
0(n log n) calls, and a union bound will be used to bound a failure probability, hence each call 
may fail with probability inversely polynomial in n). The goal of the procedure is to repeatedly 
identify, with high probability, single vertex moves that considerably decrease the cost. Note that 
in Mathieu et. al's PTAS |18| . a crucial step in their algorithms entails identifying single vertex 
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moves that decrease the cost by a magnitude which, given our sought query complexity, would not 
be detectable. Hence, our algorithm requires altering this crucial part in their algorithm. 

The procedure starts by creating a sample ensemble S = {E v ^ : v G V, i G [£?,£]}, where 
B = log [9 (£iV/ log n) J and L = [logiV]. The size of each E Vt i G S is 0(e -2 log 2 n), and each 
element (v, x) G E v i was added (with possible multiplicity) by uniformly at random selecting, with 
repetitions, an element x G V positioned at distance at most 2 l from the position of v in ir. Let 
T> n denote the distribution space from which S was drawn, and let Vix~v n [X = S] denote the 
probability of obtaining a given sample ensemble S. 

We want S to enable us to approximate the improvement in cost obtained by moving a single 
element u to position j. 

Definition 3.6. Fix u G V and j G [n], and assume log \ j — p n (u)\ > B. Let t = [log \ j — /%(it)|]. 
We say that S is successful at u, j if \{x : (u, x) G E u j} n {x : p n (x) G [p^ (u) , j] } \ = Q(e~ 2 log 2 n) . 

In words, success of S at u, j means that sufficiently many samples x G V such that p n (x) is 
between p T (it) and j are represented in E u ^. Conditioned on S being successful at u,j, note that 
the denominator of TestMove^ (defined in (I3.ip ) does not vanish, and we can thereby define: 

Definition 3.7. S is a good approximation at u,j if 

|TestMove E ^(7r,V;W,n,i) -TestMove(7r,V;W,n,i)| < ^e\j - p n (u)\/ log n , 

where I is as in Definition 13.61 

In words, S being a good approximation at u,j allows us to approximate a quantity of interest 
TestMove(7r, V, W, u, j), and to detect whether it is sufficiently large, and more precisely, at least 
^0|j - /^Ml/log n). 

Definition 3.8. We say that 5 is a good approximation if it is succesful and a good approximation 
at all u G V, j G [n] satisfying [log \ j — ^^(n)!] G [B,L\. 

Using Chernoff bounds to ensure that 5 is successful Vu, j as in Definition ^. 8| then using Hoeffding 
to ensure that S is a good approximation at all such u,j and finally union bounding we get 

Lemma 3.9. Except with probability 1 — 0(n -4 ), S is a good approximation. 



Algorithm 1 SampleAndRank(V, W, e) 
1: n <r- \V\ 

2: 7T <— Expected 0(l)-approx solution to MFAST using 0{n log n) VF-queries on expectation 

using QuickSort [l] 
3: return SampleAndDecompose(V, W, e, n, ir) 



3.3 Mutating the Pair Sample To Reflect a Single Element Move 

Line [T7] in ApproxLocallmprove requires elaboration. In lines I15tf20l we check whether there exists 
an element u and position j, such that moving u to j (giving rise to n u ^j) would considerably 
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Algorithm 2 SampleAndDecompose(y, W, e, n, ir) 

1: N <— \ V\ 

2: if N < log nj log log n then 
3: return trivial partition {V} 

4: end if 

5: E <— random subset of 0(e _4 logn) elements from f 2 ) (with repetitions) 

6: C <— Ce(tt, V, W) (C is an additive 0(e 2 N 2 ) approximation of C w.p. > 1 — n~ 4 ) 

7: if C = Q(s 2 N 2 ) then 

8: return trivial partition {V} 

9: end if 

10: 7i"i <— ApproxLocalImprove(y, W, ir, e, n) 

11: k <— random integer in the range [N/3, 2N/3] 

12: Vl <— {v G V : />7r(u) < fc}, 7Ti <— restriction of i\\ to Vj, 

13: Vr <— V \ Vl, ttr restriction of 7ri to Vr 

14: return concatenation of decomposition SampleAndDecompose(V^, W, e, n, ttl) and decompo- 
sition S ample AndDecompose(VR, W, e, n, ttr) 



Algorithm 3 ApproxLocalImprove(y, W, it, e, n) (Note: ir used as both input and output) 
1: N <— \V\, B±- [log(e(eiV/logn)l, L <(— flogiV] 
2: if TV = 0(e~ 3 log 3 n) then 
3: return 
4: end if 
5: for v G V do 
6: r «- p„.(v) 
7: for i = B . . . L do 
8: «- 

9: for m = l..B(e -2 log 2 n) do 

10: j integer uniformly at random chosen from [max{l, r — 2*}, min{n, r + 2*}] 

11: E v>i ^ E V!i U{(v,TT(j))} 

12: end for 
13: end for 
14: end for 

15: while 3u £ V and j G [n] s.t. (setting i := [log |j — p^-(it)|]^: 

£ G and TestMove^^TT, V, W, u,j) > e\j - p n (u)\/logn 

do 

16: for v £ V and i G [S, L] do 

17: refresh sample E v ^ with respect to the move u — > j (see Section 13. 3f) 
18: end for 

19: 7T <- ir u _>j 

20: end while 



improve the MFAST cost of the procedure input, based on a high probability approximate calcu- 
lation. The approximation is done using the sample ensemble S. If such an element u exists, we 
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execute the exchange tt TT u -+j. With respect to the new value of the permutation it, the sample 
ensemble S becomes stale. By this we mean, that if S was a good approximation with respect to 
7T, then it is no longer necessarily a good approximation with respecto to 7r u ->j- We must refresh 
it. Before the next iteration of the while loop, we perform in line [17] a transformation (f u ->j to S, 
so that the resulting sample ensemble ip u ^.j(S) is distributed according to More precisely, 

we will define a transformation <p such that 

<P«-tf(Z>7r) = Ar^- , (3.2) 

where the left hand side denotes the distribution obtained by drawing from T> n and applying (p u ->j 
to the result. The transformation (p u ->j is performed as follows. Denoting ip u ^j(S) = S' = {E' vi : 
v £ V,i £ [B, L]}, we need to define each E' vi . 

Definition 3.10. We say that E V) i is interesting in the context of tt and n u -tj if the- two sets T%, T 2 
defined as 

2\ = {x E V : \p n (x) - Pn (v)\ < 2 1 } (3.3) 
T 2 = {xeV:\p nu ^(x)- P7Tu ^(v)\<2 1 } (3.4) 

differ. 

We set E' v i = E v j for all v, i for which E V j is not interesting. 

Observation 3.11. There are at most 0(\ Ptt (u) — j\ logra) interesting choices ofv,i. Additionally, 
if v ^ u, then for T±,T 2 as in Definition 13.101 IT1AT2I = 0(1), where A denotes symmetric 
difference. 

Fix one interesting choice v,i. Let T\,T 2 be as in Defintion 13.101 By the last observation, each 
of Xi and T 2 contains 0(1) elements that are not contained in the other. Assume |Ti| = \T 2 \, let 
X\ = T±\T 2 , and X 2 = T 2 \ T\. Fix any injection a : X\ — > X 2 , and extend a : T\ — > T 2 so that 
a{x) = x for all x G T\ n T 2 . Finally, define 

K,i = {(v,a(x)) : (v,x)£E Vji } . (3.5) 

(The case |Ti| / \T 2 \ may occur due to the clipping of the ranges [pir(v) — 2 l ,p n (v) + 2*] and 
[Pir u ^Av) — 2*, P n u ^ i (v) + 2 l ] to a smaller range. This is a simple technicality which may be 

T T 

taken care of by formally extending the set V by iV additional elements v^,. . . ,vj^, extending 
the definition of p n for all permutation tt on V so that Pir(v^) = —a + 1 for all a and similarly 
N = \V\ additional elements v^, . . . such that Pw{Va ) = N + a. Formally extend W so that 
W(v,Va) = W(v^,v) = W(v,Va) = W(v^,v) = for all v £ V and a. This eliminates the need 
for clipping ranges in line [10] in ApproxLocallmprove.) 

Finally, for v = u we create E' v i from scratch by repeating the loop in line[7]for that v. It is easy 
to see that (13.2 holds. We need, however, something stronger that (13.2p . Since our analysis assumes 
that S ~ T> n is successful, we must be able to measure the distance (in total variation) between 
the random variable (T> n \ success) defined by the process of drawing from T> n and conditioning on 
the result's success, and V^^y By Lemma [3T9l the total variation distance between (V n \ success) 
and T>T Tu ^ j is 0(n -4 ). Using a simple chain rule argument, we conclude the following: 
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Lemma 3.12. Fix tt° on V of size N, and fix u±, . . . ,v,k £ V and ji, ... ,jk G [n]. Consider the 
following process. We draw S° from V n o , and define 

S 1 = (Pui-*ji{S°),S 2 = Vui^niS 1 ), ••• ,S = Lp Uk ^ jk (S k ~ l ) 

^! ^0 2 1 k k-l 

Consider the random variable S k conditioned on S°, S 1 , . . . , S k ~ 1 being successful for ttq, . . . , ir k ~ 1 , 
respectively. Then the total variation distance between the distribution of S k and the distribution 
V^k is at most 0(kn~ 4 ). 

3.4 Bounding the query complexity of computing (p u ^.j(S) 

We now need a notion of distance between S and 5', measuring how many extra pairs were intro- 
duced ino the new sample family. These pairs may incur the cost of querying W. We denote this 
measure as dist(5, 5'), and define it as dist(5,5') := \J V i E V) tAE' v , 

Lemma 3.13. Assume S ~ T)^ for some permutation ir, and S' = <p u -+j. Then E[dist(5, S')] = 
0(e~ 3 log 3 n). 

Proof. Denote S = {E V) i} and S' = {E' vi }. Fix some v ^ u. By construction, the sets E V j 
for which E V) i ^ E' vi must be interesting, and there are at most 0(\p n (u) — j\ logn) such, using 
Observation 13.11] Fix such a choice of v,i. By (|3.5p . E V i will indeed differ from E' vi only if it 
contains an element (v,x) for some x G T\ \ T2. But the probability of that is at most 

1 - (1 - 0(2- i )) e ( £ " 2 lo s 2 ") < 1 - e - e ( £ - 22 " 1 lo § 2 ») = 0(e- 2 2- 1 log 2 n) 

(We used the fact that i > B, where B is as defined in line [T] of ApproxLocallmprove, and N = 
il(e~ 3 log 3 n) as guaranteed in line [3] of ApproxLocallmprove.) Therefore, the expected size of 
E' vi AE v j (counted with multiplicities) is 0(e~ 2 2~ % log 2 n). 

Now consider all the interesting sets E Vl ^, . . . ,E Vp ^ p . For each possible value i it is easy to 

see that there are at most 2|p 7r (u) — j\ p's for which i p = i. Therefore, E X^p=i \E' V i p ^-E v 
O (^e~ 2 \p n (u) — j\ log 2 n Ym=b j where B, L are defined in line [T] in ApproxLocallmprove. Sum- 
ming over i £ [B, L], we get at most 0(e~ 3 \p 7r (u) — j\ log 3 n/N). For v = u, the set {E Vj i} is drawn 
from scratch, clearly contributing 0(e _2 log 3 n) to dist(5,«S'). The claim follows. □ 



3.5 Analysis of SampleAndDecompose 

Throughout the execution of the algorithm, various high probability events must occur in order for 
the algorithm guarantees to hold. Let S\,S2, ■ ■ ■ denote the sample families that are given rise to 
through the executions of ApproxLocallmprove, either between lines [5] and UM or as a mutation 
done between lines [15] and [20j We will need the first 0(ra 4 ) to be good approximations, based on 
Definition 13.81 Denote this favorable event £\. By Lemma 13.121 and using a union bound, with 
constant probability (say, 0.99) this happens. We also need the cost approximation C obtained in 
line [5] to be successful. Denote this favorable event 82- By Hoeffding tail bounds, this happens 
with probability 1 — 0(n -4 ) for each execution of the line. This line is obviously executed at most 
O(nlogn) times, and hence we can lower bound the probability of success of all executions by 0.99. 

From now throughout, we make the following assumption, which is true by the above with 
probability at least 0.97. 



13 



Assumption 3.14. Events £\ and £2 hold true. 



Note that by conditioning the remainder of our analysis on this assumption may bias some 
expectation upper bounds derived earlier and in what follows. This bias can multiply the estimates 
by at most 1/0.97, which can be absorbed in the O-notation of these bounds. 

Let 7r* denote the optimal permutation for the root call to SampleAndDecompose with V, W, e. 
The permutation ir is, by Assumption 13. 5| a constant factor approximation for MFAST on V, W. 
Using the triangle inequality, we conclude that d T (7r,7r*) < C(ir,V,W) + C(tt*,V, W) Hence, 
E[d T (TT,7r*)} = 0{C(tt*,V,W)) . From this we conclude, using ([231), that 

E[d bot (ir,ir*)]=0(C(ir*,V,W)) . 

Now consider the recursion tree T of SampleAndDecompose. Denote X the set of internal nodes, 
and by C the set of leaves (i.e. executions exiting from line [8]). For a call SampleAndDecompose 
corresponding to a node X in the recursion tree, denote the input arguments by (Vx,W,e,n,irx)- 
Let L[A],i?[A] denote the left and right children of X respectively. Let kx denote the integer k 
in [TT] in the context of X £ I. Hence, by our definitions, Vnx], Vr[x]> ^UX] an d ^r\x] are precisely 
Vl,Vr,ttl,itii from lines [I~2lfl~3l in the context of node X. 

Take, as in line [H Nx = \Vx\- Let ir x denote the optimal MFAST solution for instance 
(Vx,W\y x ). By £\ we conclude that the first G(n 4 ) times in which we iterate through the while 
loop in ApproxLocallmprove (counted over all calls to ApproxLocallmprove), the cost of nxu^j 
is an actual improvement compared to nx (for the current value of nx , u and j in iteration) , and 
the improvement in cost is of magnitude at least Q(e\p^ x (u) — j\/ logn), which is fl(e 2 Nx/ log 2 n) 
due to the use of B defined in line [TJ But this means that the number of iterations of the while 
loop in line [T5l of ApproxLocallmprove is 0(e~ 2 C(irx, Vx, W\y x ) log 2 n/Nx)- Indeed, otherwise 
the true cost of the running solution would go below 0. Since C(tt x ,Vx,W IVx ) is at most ( N 2 X ), 
the number of iterations is hence at most 0(e~ 2 Nx log 2 n). By Lemma 13.131 the expected query 
complexity incurred by the call to ApproxLocallmprove is therefore 0(e~ 5 Nx log 5 n). Summing 
over the recursion tree, the total query complexity incurred by calls to ApproxLocallmprove is, on 
expectation, at most 0(e _5 n log 6 n). 

Now consider the moment at which the while loop of ApproxLocallmprove terminates. Let tt\x 
denote the permutation obtained at that point, returned to SampleAndDecompose in line 1101 We 
classify the elements v £ Vx to two families: V x hmt denotes all u £ Vx s.t. \pn lx (u) — Ptt* x ( u )\ = 
0(eNx/ log n), and V x ng denotes Vx \ V x hort . We know, by assumption, that the last sample 
ensemble S used in ApproxLocallmprove was a good approximation, hence for all u 6 V x ng , 

TestMove(7Tix,Vx,W'|v x , , ",Pn-* : («)) = 0(e\p nix (u) - p^*.(u)\/logn). (3.6) 

Definition 3.15 (Kenyon and Schudy [E]). For u G Vx, we say that u crosses kx if the interval 
[p-K lx (u), P-k* x (u)] contains the integer kx- 

Let V x TOSS denote the (random) set of elements u G Vx that cross kx as chosen in line HU We 
define a key quantity Tx as in [18] as follows: 

T x = ^2 TestMove (7ri X ,Vv, W\ Vx ,u, p w * x (u)) . (3.7) 

ugV xross 
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Following (|3.6p . the elements u G V x ng can contribute at most 



O 



' \P^ix( u ) - p7r* x (u)\/logn 



to Tx- Hence the total contribution from such elements is, by definition 0{ed{ 00 t{-K\x,^*x) I logn) 
which is, using (|2.4p at most 0(ed T (irix, k x )/ log n). Using the triangle inequality and the definition 
of tt* x , the last expression, in turn, is at most 0(eC(Tt\xi Vx, W\y x )/\ogn). 

We now bound the contribution of the elements u G V x hort to Tx ■ The probability of each such 
element to cross k is 0(\p 7rix (u) — p^* (u)\/Nx)- Hence, the total expected contribution of these 
elements to Tx is 

of E lfr lx W-p 4 W| 2 /^ • (3-8) 
\uevj hort / 

Under the constraints Y,ueV* hoH \P^ix( u ) ~ Pir x ( u )\ < dfooti^ix^x) and Wi X ( u ) ~ Pn x ( u )\ = 
0(sNx I logn), the maximal value of ()3.8[) is 

0(d ioot (irix,TT x )ENx/(Nxlogn)) = 0(d {oot (ir lx , n* x )e/ lo g n ) ■ 

Again using (]2.4p and the triangle inequality, the last expression is 0(eC(irix, Vx, W\y x ) / \ogn). 
Combining the accounting for U lon s and V OIt , we conclude 

E kx [T x ] = 0(eC(7Tx,V x ,W lVx )/logn) , (3.9) 

where the expectation is over the choice of kx in line QT] of SampleAndDecompose. 

We are now in a position to use a key Lemma from Kenyon et al's work |18j . First we need a 
definition: Consider the optimal solution tt' x respecting V^pf], Vr[-X] in lines H2] and [13] By this 
we mean that tt' x must rank all of the elements in Vxl before (to the left of) Vrx- For the sake of 
brevity, let C x be shorthand for C(tt x , Vx, W\y x ) and C' x for C(ir' x , Vx, W\y x ). 

Lemma 3.16. [Kenyon and Schudy Jig]/ / With respect to the distribution of the number kx in 
line [TJJ of SampleAndDecompose, 

E[C' X ] < O ^ot^ix^*x) 3/2 \ + E[Tx] + c * x (3 1Q) 

Using (|2.4p . we can replace df oot (7Tix, ^x) W1 th d T {-K\x, m H^.lOp . Using the triangle in- 
equality, we can then, in turn, replace d T {i:ix,^*x) with C(TVix,Vx,Wiy x ). 



3.6 Summing Over the Recursion Tree 

Let us study the implication of (13. lQj) for our purpose. Recall that {V\, . . . , Vy-} is the decomposition 
returned by SampleAndRank, where each V% corresponds to a leaf in the recursion tree. Also recall 
that 7T denotes the minimizer of C(-, V, W) over all permutations in n(Vi, . . . , Vk) respecting the 
decomposition. Given Assumption 13.141 it suffices, for our purposes, to show that tt is a (relative) 
small approximation for MFAST on V,W. Our analysis of this account is basically that of |18j . 
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with slight changes stemming from bounds we derive on E[Tx]- We present the proof in full detail 
for the sake of completeness. Let RT denote the root node. 

For X el, let (3 X denote the contribution of the split L[X],R[X] to the LHS of $lM- More 
precisely, 

Px = 1-W(v,u)=l > 

ueL[X],veR[X] 



so we get ^2i<i<j<k S 



<(u,v)&ViXVj - L W r («,u)=l 



YsXeiPx- 



For any X E X, note also that by our definitions /3 X = C) 



x 



Lemma 13.161 and the ensuing comment, 

~C(TT 1X ,V X ,W lVx f/ 2 



°L[X] 



C* R y x y Hence, using 



E[(3 X ] <0\E 



N x 



+ E[T X ] + E[C* X } - E[C* L 



L[X)i 



E[C* B 



R[X]\ 



where the expectations are over the enitre space of random decisions made by the algorithm exe- 
cution. Summing the last inequality over X G X, we get (minding the cancellations): 



E 



xex 



x 



< 



° E £ 



\Xei 



C(tt 1x ,Vx,W IVx )V 2 



N 



x 



+ E 



Xei 



+ ci x -^^[c^a.ii) 



The expression E\^ XeX Tx\ is bounded by O (E [X^fex £ X] ^i/ 1°8 n ] ) using (|3.9|) (which 
depends on Assumption 13. 14j) . Clearly the sum of C x for X ranging over nodes X £ I in a 
particular level is at most C(7tb.t,V, W) (again using Assumption 13.141 to assert that the cost 
of ttix is less than the cost of ttx at each node X). By taking Assumption 13.51 into account, 
C(ttrT) V, W) is 0(C^ T ). Hence, summing over all O(logn) levels, 



E 



xei 



0(eQ T ) 



(3.12) 



Let C\x = C(ttix,Vx,W\v x ) for all x E X. Denote by F the expression in the O-notation of the 
first summand in the RHS of (|3.1ip . more precisely: 



F 



E^ 



xex 



r 3/2 
N x 



(3.13) 



where we remind the reader that Nx = \Vx\- It will suffice to show that under Assumption 13.14] 
the following inequality holds with probability 1: 



G ((Cix)xei, (Nx)xex) ■= 



(3.14) 



xex 



where C3 > is some global constant. This turns out to require a bit of elementary calculus. A 
complete proof of this assertion is not included in [18] , which is an extened abstract. We present a 
version of the proof here for the sake of completeness. 

Under assumption 13.141 the following two constraints hold uniformly for all X E X with proba- 
bility 1: Letting C x = C(ir x , V x , W\ Vx ), 
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(Al) If X is other than RT, let Y be its sibling and P their parent. In case 

C 1X + C 1Y <C 1P . (3.15) 

(In case Y G £, we simply have that Cix < Cip@) To see this, notice that C\x < Cx, and 
similarly, in case F G X, Ciy < Cy. Clearly Cx + Cy < Cip, because 7rx,vry are simply 
restrictions of 7Ti P to disjoint blocks of Vp. The required inequality (I3.15P is proven. 

(A2) Ci X < c 2 e 2 N x for some global c 2 > 0. 

In order to show (|3.14|) . we may increase the values C\x for X ^ RT in the following manner: 
Start with the root node. If it has no children, there is nothing to do because then G = 0. If 
it has only one child I £ I, continuously increase C\x until either C\x = Cirt (making (Al) 
tight) or C\x = C2£ 2 N X (making (A2) above tight). Then recurse on the subtree rooted by X. 
In case RT has two children X, Y G X (say, X on left), continuously increase C\x until either 
C\x + Ciy = C 1RT ((Al) tight) or until C xx = c 2 e 2 N x ((A2) tight) . Then do the same for Ciy, 
namely, increase it until (Al) is tight or until Ciy = c 2 e 2 Ay ((A2) tight). Recursively perform the 
same procedure for the subtrees rooted by X,Y. 

After performing the above procedure, let X\ denote the set of internal nodes X for which (Al) 
is tight, namely, either the sibling Y of X is a leaf and Cix = Cip (where P is X's parent) or 
the sibling Y G I and C\x + Ciy = C\ P (in which case also Y G I\). Let Z 2 = X \ T\. By our 
construction, for all X G 1%, C\x = c 2 £ 2 A|-. 

Note that if X G X 2 then its children (more precisely, those in X) cannot be in X\. Indeed, 
this would violate (A2) for at least one child, in virtue of the fact that Ay lies in the range 
[Ax/3, 2Ax/3] for any child Y of X. Hence, the set X\ U {RT} forms a connected subtree which 
we denote by T\. Let P G 71 be an internal node in T\. Assume it has one child in 7i, call it 
X. Then Cix = C 1P and in virtue of N x < 2A P /3 we have C^J, 2 /N P < (2/J,f/ 2 Ci x !2 /N x . 
Now assume P has two children X, Y G T\. Then C\x + Ciy = Cip. Using elementary calculus, 
we also have that Cip 2 /N P < (C\ 3 X 2 /Nx + Ciy 2 /Ny)/V2 (indeed, the extreme case occurs for 
Ax = Ay = Np/2 and Cix = Ciy = C\p/2). We conclude that for any P internal in 71, the 

3/2 

corresponding contribution Cip /N P to G is geometrically dominated by that of its children in 

o / o Q / 9 

X\. Hence the entire sum G\ = X^xgIiU{RT} ^ix l^x is bounded by Q Sxg£i ^ix l^x for 
some constant C4, where C\ is the set of leaves of 71. For each such leaf X G C\, we have that 
Ci^ /2 /Ax < 4 /2 eC lx (using (A2)), hence Ex e£l CixV^x < Exec, ^T^ix < ^ /2 eCip (the 
rightmost inequality in the chain follows from {Vv}xe£i forming a disjoint cover of V = Vrt> 
together with (Ai)). We conclude that G\ < c^ 2 eC\p. 

To conclude (|3.14j) . it remains to show that G 2 = G — Ci = J2 X ei2' ^ or ^ ^ clearly 
Cip /2 /A P = C 2 /2 e 3 Ap. Hence, if X, Y G G 2 are children of P in X 2 then Cip /2 /A P > c 5 Ci5/ 2 /Ax + 

Ciy 2 /Ay and if A is the unique child of P in X 2 , then C\p 2 jNp > c^Cx^ 2 /Ax, for some global 
C5 > 1. In other words, the contribution to G 2 corresponding to P geometrically dominates the sum 
of the corresponding contributions of its children. We conclude that G 2 is at most some constant 

0/9 

cq times Sxeroot(J 2 ) ^ 1 x /^Xi where root(X 2 ) is the set of roots of the forrest induced by X 2 . 
As before, it is clear that {Vx} x&oot(x 2 ) ^ s a disjoint collection, hence as before we conclude that 
G 2 < c-jeCip for some global cj > 0. The assertion (|3.14p follows, and hence (|3.13p . 

4 We can say something stronger in this case, but we won't need it here. 
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Plugging our bounds in (|3. 11 [) . we conclude that 



E 



X 



x\ 



xec 



Clearly C(tt, V, W) = J2 XeI Px+ExeC C h H ence E[C(fi, V, W)\ 

We conclude the desired assertion on expectation. 

A simple counting of accesses to W proves Theorem 13.11 



(l+0(e))C£ T = (1+0(e))C* 



4 Using Our Decomposition as a Preconditioner for SVM 

We consider the following practical scenario, which is can be viewed as an improvement over a 
version of the well known SVMrank |16} [14] for the preference label scenario. 

Consider the setting developed in Section 12.11 where each element u in V is endowed with a 
feature vector <p{u) G K d for some d (we can also use infinite dimensiona spaces via kernels, but 
the effective dimension is never more than n = \V\). Assume, additionally, that ||</>(it)||2 < 1 f° r a h 
u G V (otherwise, normalize). Our hypothesis class H. is parametrized by a weight vector w G M. d , 
and each associated permutation ir w is obtained by sorting the elements of V in decreasing order 
of a score given by scove w (u) = (tp(u),w). In other words, u -< nw v if score™(u) > score^(w) (in 
case of ties, assume any arbitrary tie breaking scheme). 

The following SVM formulaion is a convex relaxation for the problem of optimizing C(h, V, W) 
over our chosen concept class H: 



(SVM1) 



minimize 



s.t. Vu, v : W(u, v) = 1 
Vit, v 



score™ (u) — score™ (v) > 1 — £ U)V 

tu,v > 
\\w\\ < c 



Instead of optimizing (SVM1) directly, we make the following observation. An e-good decompo- 
sition Vi, . . . , Vk gives rise to a surrogate learning problem over II(Vi, . . . , Vk) C 11(1/), such that 
optimizing over the restricted set does not compromise optimality over n(F) by more than a rela- 
tive regret of e (property (|2.3p ). In turn, optimizing over II(Vi, . . . , Vk) can be done separately for 
each block Vi . A natural underlying SVM corresponding to this idea is captured as follows: 



(SVM2) 



minimize -F2 (if , £ ) 



s.t. V(u,v) G Ai U A 2 
Vu, v 



£ 

M,«eAiuA 2 



VU,'. 



score™ (u) — score™ (v) > 1 — £ U)V 
tu,v > 

\\W\\ — C ! 



where Ai 



Ul<i<j<k V i x V j and A 2 = U,-=i{(« 5 «) = u,v G Vi A W(u,u) = 1}. 
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Abusing notation, for s.t. < c, let Fi(w) denote minFi(w,£), where the minimum 

is taken over all £ that satisfy the constraints of SVM1. Observe that Fi(w) is simply Fi(w, £), 
where £ is taken as: 

{max{0, 1 — score w (u) + score U) (f)} W(u, v) = 1 (4 1) 

otherwise 

Similarly define F2(w) as the minimizer of F2(w, £), which is obtained by setting: 

^ [ max{0, 1 - score„,(u) + score„,(f )} (u, v) G Ai U A 2 ,^ ^ 

1 otherwise 

Let 7T* denote the optimal solution to MFAST on V,W. 

We do not know how to directly relate the optimal solution to SVM1 and that of SVM2. 
However, we can showwe can replace SVM2 with a careful sampling of constraints thereof, such 
that (i) the solution to the subsampled SVM is optimal to within a relative error of e as a solution 
to SVM2, and (ii) the sampling is such that only 0(npolylog(n, e -1 )) queries to W are necessary 
in order to construct it. This result, which we quantify in what follows, strongly relies on the local 
chaos property of the e-good decomposition (|2.2p and some combinatorics on permutations. 

Our subsampled SVM which we denote by SVM3, is obtained as follows. For ease of notation 
we assume that all blocks V\, . . . , are big in V, otherwise a simple accounting of small blocks 
needs to be taken care of, adding notational clutter. Let A3 be a subsample of size M (chosen 
shortly) of A2, each element chosen uniformly at random from A2 (with repetitions - hence A3 is 
a multi-set). Define: 



(SVM3) minimize F 3 (w,£) = £ ^ V + 2^J1 £ ^ 

s.t. \/(u,v) G Ai U A3 scoreu,(n) — score^?;) > 1 — £ u>v 
Vw, v i UyV > 
\\w\\ < c 



As before, define F^(w) to be Fs(w, £), where £ = £(w) is the minimizer of F$(w, •) and is taken 

as 

{max{0, 1 — score^-u) + score„,(u)} (u, v) G Ai U A3 ^ 
otherwise 

Our ultimate goal is to show that for quite small M, SVM3 is a good approximation of SVM2. 
To that end we first need another lemma. 

Lemma 4.1. Any feasible solution (w, £) for SVM1 satisfies ^ uv iu,v > C(tt* ,V,W). 

Proof. The following has been proven in [T]: Consider non-transitive triangles induced by W: These 
are triplets (u,v,y) of elements in V such that W(u,v) = W(v,y) = W(y,u) = 1. Note that any 
permutation must disagree with at least one pair of elements contained in a non-transitive triangle. 
Let T denote the set of non-transitive triangles. Now consider an assignment of non-negative 
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weights Pt f° r each t £ T. We say that the weight system {/3t}teT packs T if for all u, v G V such 
that W(u, v) = 1, the sum Yl( u v) in t Pt ^ s a ^ mos t 1- (By t; tnt we mean that u, v are two of the 
three elements inducing t.) Let {f3^}t£T be a weight system packing T with the maximum possible 
value of the sum of weights. Then 

Y,fi>C(TT*,V,W)/3. (4.4) 

teT 

Now consider one non-transitive triangle t = (u, v, y) G T. We lower bound £ u>v + £ Vt y + £ VtU 
for any £ such that w,£ is a feasible solution to SVM1. Letting a = score™ (it) — scoie w (v),b = 
scoie w (v) — score U) (y),c = score lu (y) — score™ (it), we get from the constraints in SVM1 that t; U)V > 
1 — a, £ v>y > 1 — b, £y )U > 1 — c. But clearly a + b + c = 0, hence 

£u,i; + "I" Cy,u ^ 3 . (4-5) 
Now notice that the objective function of SVM1 can be bounded from below as follows: 

t=(u,v,y)eT 

> E # ■ 3 

t=(u,u,y)eT 

> C(vr*,y,VF) . 

(The first inequality was due to the fact that {Pt}t£T is a packing of the non-transitive triangles, 
hence the total weight corresponding to each pair u,v is at most 1. The second inequality is from 
(14. 5h and the third is from (14. 4j) .) This concludes the proof. □ 

Theorem 4.2. Let e G (0,1) and M = 0(e~ 6 (l + 2c) 2 dlog(l/e)). Then with high constant 
probability, for all w such that \\w\\ < c, 

\F 3 (w) - F 2 (w)\ = 0{eF 2 {w)) . 

Proof. Let B^(c) = {z G M. d : \\z\\ < c}. Fix a vector w G -B^(c). Over the random choice of A3, 
it is clear that E[Fs(w)] = F 2 (w). We need a strong concentration bound. From the observation 
that \£ u ,v\ < 1 + 2c for all u,v, we conclude (using Hoeffding bound) that for all fi > 0, 

Pr[\F :i (ir) - ./ J 2<"')| > /'I ■£. cxp { - — : ^-^ ^ > • (■■'•<■>) 



(Eti r 2 i )(i+2 C ) 

Let 77 = e 3 and consider an 77-net of vectors iu in the ball Bd(c). By this we mean a subset T C Bd(c) 
such that for all 2; G B^c) there exists tufT s.t. ||z — 10 1| < 77. Standard volumetric arguments 
imply that there exists such a set T of cardinality at most {c/rf) d . 

Let z G T and w G S^(c) such that \\w — z|| < rj. From the definition of F 2 , F3, it is clear that 

\F 2 (w) - F 2 (z)\ < £ (^V, |F 3 H - F s (*)| < E ftV • ( 4 - 7 ) 

i=l ^ ' j=i V / 
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Using (|4.6p . we conclude that for any fj, > 0, by taking M = 0{yT 2 {Y, ( r %)) 2 (l+2c) 2 dlog(cr)- 1 )), 
with constant probability over the choice of A3, uniformly for all z 6 V: 

\F 3 (z)-F 2 (z)\<». 

Take fj, = e 

Ei=i ("2)- We conclude (plug ging in our choice of [i and the definition of rj) that by 

choosing 

M = 0(e- 6 (l + 2cfd\og{c/e)) , 
with constant probability, uniformly for all z6f: 



\Fz(z)-F 2 (z)\<e 3 J2(i) ■ 

i=l ^ ' 



Using (|4.7p and the triangle inequality, we conclude that for all tu G -B^(c) 

A; 



|F3M-F 2 H|<3e 3 ^Q . (4.9) 



i=l 

By property (|2.2p of the e-goodness definition, (|4.9p imples 

fe k 
\F 3 (w)-F 2 (w)\<3e min y% ! ,y i ,%)=3 £ ^ min C^W^ 

By Lemma 14.11 applied separately in each block Vi , this implies 

\F 3 {w) -F 2 (w)\<3eJ2 Yl = 3ei? 2H, 

i=l u,v£Vi 

(where £ = is as defined in (|4.2p .) This concludes the proof. 



□ 
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A Linear VC Bound of Permutation Set 

To see why the VC dimension of the set of permutations viewed as binary function over the set of 
all possible Qj preferences, it is enough to show that any collection of n pairs of elements cannot be 
shattered by the set of permutation. (Refer to the definition of VC dimension [19] for a definition of 
shattering). Indeed, any such collection must contain a cycle, and the set of permutations cannot 
direct a cycle cyclically. 
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